Another openly accessible source is the “Scuola in chiaro” portal through which it is possible to browse among the single schools. The information displayed through this portal comes both from the aforementioned MIUR Open Data and from additional data that the schools are enabled to provide based on another database controlled by the MIUR, namely the Information Sistem of Education (SIDI).
All Italian schools are identified through a unique 10-character ID called mechanographical code. Its two first characters denote the province and the following two characters identify the type of school.
The school anagraphics database (available here) includes both public and private schools, yet we are going to consider only the former in that the school buildings dataset (next section) does not include private schools. Data are detailed at the level of single schools and each school is mapped to its reference institute (i.e. a unique institute may serve more than one venue). Each school on its turn is located in a physical building. Buildings IDs are taken from the school buildings dataset which has been joined to this one via the schools’ mechanographical code.
Here we show an example of the hierarchy among reference institute, school, and school building. A single school may be made up of several distinct buildings, but one building may host more than one school. The top node is the reference institute (BAIS033007), a Comprehensive Institute in this case; the intermediate nodes are the single schools identified via the mechanographical code; the bottom nodes are the school buildings. Since that of school building is a physical concept it is intuitive to link a spatial location to a building. Indeed the initial digits of the building ID are the ID code of the municipality in which the building is. In this example we have two schools located in the municipality of Casamassima (BA, municipality code = 72015) and five other schools distributed among three buildings located in the municipality of Acquaviva delle Fonti (BA, municipality code = 72001). Thus, to summarise, the reference institute heads 7 different high schools which are located in a total of 4 buildings.
example.scheme()
As it can be seen there is no one-to-one relationship between the school and the physical building. In order to have as many observed territorial units as possible, the target variable (i.e. the Invalsi score) has been detailed at the level of the municipality in which buildings are located rather than at the level of the venue of the reference institute. Therefore we only need to aggregate the school buildings dataset at the municipality level (for province-level data the issue is easier since data are observed in each province and all school buildings belong to the same province as the reference institute).
Here
the link between the mechanographical code and the education grade is
briefly summarised. A major issue arises when it is not possible to
assign a single school to a given education grade (i.e. primary school,
middle school and high school) by means of the mechanographical codes.
This is due to two reasons. On the one hand, some types of schools
do not follow the same rules as the ordinary ones (like religious
schools, which are not relevant in our analysis so in this case the
problem does not even exist). On the other hand, single schools
classified as Comprehensive Institutes (code IC, henceforth referred to
with this code) cannot be linked to a specific education grade.
In
addition, there may be a duplication issue: Comprehensive Institutes (as
reference institutes) include a number of different schools belonging to
different grades. For each set of schools belonging to one reference
Comprehensive Institute, besides the single schools classified in detail
as primary, middle, or high schools of a specific type (e.g. scientific
or classical high schools, art institutes, technology institutes,
professional institutes), it may happen to find a single school
classified as a Comprehensive Institute. In this case no education order
can be assigned to the school, nor is there any information about
whether the mechanographical code identifies a standalone school with
its standalone building or whether the record is simply a duplicate.
Hence the problem with ICs is twofold:
Most schools classified as ICs have a physical building ID that appears more than once (e.g. 4.816 over 5103 for year 2021-2022), which means that in these cases the same information is repeated both for a properly identifiable school and for a school classified as IC:
# 2022 data
table(ifelse(ICs22$CODICE_EDIFICIO %in% doub.buildings22$CODICE_EDIFICIO, "Duplicated", "Unique") )
##
## Duplicated Unique
## 4816 287
# 2023 data
table(ifelse(ICs23$CODICE_EDIFICIO %in% doub.buildings23$CODICE_EDIFICIO, "Duplicated", "Unique") )
##
## Duplicated Unique
## 4837 289
For certain ICs whose building code is unique among the DB we notice that some physical variables still appear to be redundant with other buildings having that school as the reference institute (e.g. an IC has the same school volume as another building among the pertaining schools). In case this aspect is deemed worthy of being inspected, here we link to an interactive table displaying some physical variables (physical measures and address) of ICs whose building code is unique. It is necessary to visualize this file with a Google Drive address to use interactive selection. Please notice that as this is a completely open-access file all data displayed in it are information of public domain, downloaded for free from openly accessible web pages.
A similar problem occurs with the category of Superior Institutes (ISs). ISs are often the reference institute for a number of high schools, hence schools with mechanographical codes including “IS” are certainly high schools; the only problem is the duplication issue we have just discussed for ICs.
Similarly to ICs, most ISs have a building code which appears also for other schools:
# 2022 Data
table(ifelse(ISs22$CODICE_EDIFICIO %in% doub.buildings22$CODICE_EDIFICIO, "Duplicated", "Unique") )
##
## Duplicated Unique
## 1731 155
# 2023 Data
table(ifelse(ISs23$CODICE_EDIFICIO %in% doub.buildings23$CODICE_EDIFICIO, "Duplicated", "Unique") )
##
## Duplicated Unique
## 1733 149
As for ICs, here it is possible to visualize all schools referring to a Superior Institute with an unique building code
Given the ambiguity of these records, in the absence of further information about the physical nature of schools classified as ICs or ISs we have decided not to include them in the covariates DB which is discussed in the section below.
This is the main database of our analysis (link).
To the best of our knowledge this is by far the most important source of
information about the territorial distribution of school infrastructure.
It only includes public schools. Data are detailed at the level of
physical buildings and single schools, hence each row is a unique
combination of these two IDs. Most variables are Boolean.
Schools need to be grouped either by education grade and municipality
or by education grade and province. The major issue lies in the
imputation of undefined records - i.e. all responses coded as “NON
DEFINITO” (“undefined”, indeed). Additionally, since some variables have
a number of undefined observations in \(O(10^3)\) or \(O(10^4)\), they cannot be used as
covariates in the development of the forthcoming analysis, hence they
will be ruled out. Since by default we delete all rows displaying a
missing observation, we should be balancing the horizontal and the
vertical cutout (e.g. in the absence of a column cutout we should filter
out more than 20.000 rows, and vice versa), hence the need for a
numerical threshold, which we have established as 1000 missing
observations needed to rule out a column.
Moreover, we rule out
some variables that are simply of scarce interest or cannot be
aggregated at the province or municipality level. These same operations
may be repeated for any other available year other than 2021-2022.
Only when we have solved each ambiguity caused by missingness can we
aggregate our records at the municipality or province level.
In the
Appendix we display how we have chosen to clean the dataset and
aggregate data.
As an example of how the final dataframes are
structured, we show the percentage of high schools served by urban
public transport in the Apulia region for the school year 2021-2022.
Clearly, the not available areas are the municipalities in which no high
school is located.
Map_covar1(Year = 2022, nfield = 31, level = "Municipality", region_code = 16, plot = "Mapview", pal = "Blues", col.rev = FALSE, type ="Superiore")
Here we provide a link to a complete mapping of all the variables included in the final dataframes:
| Year | Province | Municipality (Apulia region) |
|---|---|---|
| 2021-2022 | ||
| 2022-2023 |
The first three datasets are perhaps the most interesting ones in that they can be directly linked to the main DB. All these DBs are detailed by the school code and the school year. However, while the combinations of school code and year are the same for DBs 1) and 3), DB 2) lacks 4120 of such combinations:
Here we summarise the number of schools for which the number of students is available
MIUR_StudNumAvailability %>% group_by(TIPO) %>%
summarise(DISPONIBILE = sum(DISPONIBILE_NUMERO_STUDENTI),
NON_DISPONIBILE = n() - sum(DISPONIBILE_NUMERO_STUDENTI))
As can be seen the number of students is completely unavailable for
Comprehensive Institutes and for Superior Institutes yet the frequency
of missing records for the clearly classified schools is relatively low
(4.79% for high schools, 1.83% for elementary schools, 3.59% for
elementary schools). In the absence of information provided by the MIUR
Open Data we have tried to resort to the “Scuola in Chiaro” portal, but
for most of these schools there is no relevant information on this site
either. As a consequence we just cannot include them in the covariates
file.
Among the five datasets listed above, we are going to employ
the dataset with the detail of the number of classes for each year.
As it has been said in previous sections, we ultimately need to detail data at the municipality level (or province level, which is easier). However, the student number data only have the school detail with no information about the physical buildings and some schools (98 schools in 2022 of which 52 are in the scope of this research) are served by several buildings located in different municipalities. Therefore we need a criterion to map unambiguously each school to one municipality. This aim can be accomplished by relying on the school anagraphics dataset (first section) in which each row corresponds to one school and the municipality of the main physical structure for that school is provided. Based on this information we have aggregated this dataset at the municipality and province level.
Number of teachers, both tenured and substitutes by age class and education grade (i.e. kindergarten, primary school, middle school and high school). Data are detailed only at the province level, therefore it is impossible to map this information to municipalities or single schools. In the following map we show the number of teachers per student across the Italian provinces in the school year 2021-2022:
Prov_shp %>% rename(CODICE_PROVINCIA = COD_PROV ) %>% dplyr::select(CODICE_PROVINCIA) %>%
left_join(filter(Docenti_per_alunno_statale_2022, ORDINE_SCUOLA == "SCUOLA SECONDARIA II GRADO"),
by = "CODICE_PROVINCIA") %>%
mapview(zcol = "Docenti_per_alunno", popup = paste0(set.popup.height(220), leafpop::popupTable(.)),
col.regions = grDevices::hcl.colors(nrow(.)-3, palette = "Blues"))